Skip to content

Conversation

@jaredoconnell
Copy link
Collaborator

Summary

This PR re-enables, tests, and documents the preprocess dataset command.
Also changes the format that prompt and output sizes are specified, and makes the code aware of prefixes.

Details

  • Uses the post-refactor code to re-enable the command.
  • Switches over to the same format used by benchmark run's synthetic data for the data config to enable more features and make the command more cohesive with the rest of GuideLLM.
  • Adds options for prefixes. I added an option to include prefixes in the count, since prefixes are included in input and output tokens, and affect performance.

Test Plan

  • Run with a known dataset, or create one as a simple CSV.
  • New tests are added that should cover everything except huggingface uploads. They are all at least in part generated by AI, but I went through each one iteratively to ensure they do what they need to do.

  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Also includes some new behavior

Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <[email protected]>
Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <[email protected]>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <[email protected]>
Generated-by: Cursor AI
Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
Copy link
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reuse of SyntheticTextDatasetConfig is not a great idea. prefix_buckets are defined differently here. For synthetic data a prefix bucket of prefix_tokens=10,prefix_count=1 means you get one identical prefix for the entire dataset. As implemented here prefix_tokens=10,prefix_count=1 will only ensure that every row has a prefix of length 10. It does not guarantee any shared prefix between rows.

Rather then reuse SyntheticTextDatasetConfig I think the best option is to create a new config format that is similar only where it makes sense. For example:

prompt_tokens:
prompt_tokens_stdenv:
prompt_tokens_min:
prompt_tokens_max:
output_tokens:
output_tokens_stdenv:
output_tokens_min:
output_tokens_max:
prefix_tokens:
prefix_tokens_stdenv:
prefix_tokens_min:
prefix_tokens_max:

Treat prefix the same as prompt and output.

ERROR = "error"


def handle_ignore_strategy(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For better organization I think these handle_*_strategy methods should be defined on a static class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Use a separate config for preprocess's config, but it inherits several fields from a new shared class with the synthetic config. I did this so that the relevant fields are shared, lowering complexity.

Signed-off-by: Jared O'Connell <[email protected]>
Moved short prompt strategies to a static class

Assisted-by: Cursor AI
Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
Signed-off-by: Jared O'Connell <[email protected]>
@jaredoconnell
Copy link
Collaborator Author

I moved it to its on class in a way that retains a single source of truth so that we can use the same documentation. I just simplified it to only have the option of trimming prefixes. I decided that it wouldn't make sense to use a randomized size sampling because that's not how samples work in real scenarios typically.

Use separate class for preprocess config

Signed-off-by: Jared O'Connell <[email protected]>
Copy link
Collaborator

@sjmonson sjmonson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to double-check that benchmark run is unaffected but LGTM.

Copy link
Collaborator

@markurtz markurtz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, my only comment is if we could move the bulk of what's currently in entrypoints.py into a separate file or package, that way we can keep that file dedicated to just the intended exposed APIs and the rest for classes and logic functions would be nested under a sub namespace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants